On some document clustering algorithms for data mining
نویسندگان
چکیده
We consider the problem of clustering large document sets into disjoint groups or clusters. Our starting point is recent literature on effective clustering algorithms, specifically Principal Direction Divisive Partitioning (PDDP), proposed by Boley in [1] and Spherical k-Means (“S–kmeans” for short) proposed by Dhillon and Moda in [4]. In this paper we study and evaluate the performance of these algorithms and propose specific refinements. We also explore the effectiveness of PDDP for various partitioning and termination rules. Finally we present results that demonstrate the effectiveness of both PDDP and Skmeans, for the computation of low rank matrix approximations. Document clustering is heavily used in many fields including data mining and information retrieval. The vector space model in document clustering uses an m×n, e.g. (term) by (document) or (term frequency by document) matrix, where m is the number of terms or attributes and n the number of documents. Clustering algorithms can be divided into “partitional” and “hierarchical”. Examples of the former are the k-means algorithm [5] as well as its aforementioned variation, S–kmeans. The latter, hierarchical class includes algorithms that produce clusters via a recursive “agglomerative” (bottom-up) or “divisive” (top-down) process. A recent effective divisive algorithm from this class is PDDP. k-means and its aforementioned variant are easy to implement and appear to lend themselves well to parallelization [3]. On the other hand, they are prone to converge to solutions that are only locally optimal (finding a global minimum is NP-complete) and
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها
Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملClustering and Ranking University Majors using Data Mining and AHP algorithms: The case of Iran
Abstract: Although all university majors are prominent and the necessity of their presences is of no question, they might not have the same priority basis considering different resources and strategies that could be spotted for a country. This paper focuses on clustering and ranking university majors in Iran. To do so, a model is presented to clarify the procedure. Eight different criteria are ...
متن کاملA Multi-Objective Approach to Fuzzy Clustering using ITLBO Algorithm
Data clustering is one of the most important areas of research in data mining and knowledge discovery. Recent research in this area has shown that the best clustering results can be achieved using multi-objective methods. In other words, assuming more than one criterion as objective functions for clustering data can measurably increase the quality of clustering. In this study, a model with two ...
متن کاملبررسی مشکلات الگوریتم خوشه بندی DBSCAN و مروری بر بهبودهای ارائهشده برای آن
Clustering is an important knowledge discovery technique in the database. Density-based clustering algorithms are one of the main methods for clustering in data mining. These algorithms have some special features including being independent from the shape of the clusters, highly understandable and ease of use. DBSCAN is a base algorithm for density-based clustering algorithms. DBSCAN is able to...
متن کامل